A currated list of useful tools for data analysis.

  • pandas_profile
  • pyviz
  • resumetable
  • feature_transform (my library)

I will explore the transformations using the wine data set from Kaggle.

from pathlib import Path


import pandas as pd
#import numpy as np
#from scipy.stats import kurtosis, skew
from scipy import stats
# import math
# import warnings
# warnings.filterwarnings("error")

from google.colab import drive
mnt=drive.mount('/content/gdrive', force_remount=True)


root_dir = "/content/gdrive/My Drive/"
base_dir = root_dir + 'redwine'
csv_path = (base_dir+'/winequality-red.csv')
df=pd.read_csv(csv_path)
Mounted at /content/gdrive
# https://gist.github.com/harperfu6/5ea565ee23aaf8461a840c480490cd9a

pd.set_option("display.max_rows", 1000)
def resumetable(df):
    print(f'Dataset Shape: {df.shape}')
    summary = pd.DataFrame(df.dtypes, columns=['dtypes'])
    summary = summary.reset_index()
    summary['Name'] = summary['index']
    summary = summary[['Name', 'dtypes']]
    summary['Missing'] = df.isnull().sum().values
    summary['Uniques'] = df.nunique().values
    summary['First Value'] = df.loc[0].values
    summary['Second Value'] = df.loc[1].values
    summary['Third Value'] = df.loc[2].values
    
    for name in summary['Name'].value_counts().index:
        summary.loc[summary['Name'] == name, 'Entropy'] = \
        round(stats.entropy(df[name].value_counts(normalize=True), base=2), 2)
    
    return summary

Typically, the first thing to do is examine the first rows of data, but this just gives you a very rudimentary feel for the data.

df = pd.read_csv(csv_path) 
df.head()
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8 6
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5

I found resumetable() to be very convenient. We get a sense of cardinaltiy from Uniques and we can easily see where we are missing data.

Also, knowing the datatypes of each column is helpful when in comes to pre-processing the data.

I came across this funtion on Kaggle(I think) and found it incredibly helpful

resumetable(df)
Dataset Shape: (1599, 12)
Name dtypes Missing Uniques First Value Second Value Third Value Entropy
0 fixed acidity float64 0 96 7.4000 7.8000 7.800 5.94
1 volatile acidity float64 0 143 0.7000 0.8800 0.760 6.39
2 citric acid float64 0 80 0.0000 0.0000 0.040 5.87
3 residual sugar float64 0 91 1.9000 2.6000 2.300 4.78
4 chlorides float64 0 153 0.0760 0.0980 0.092 6.22
5 free sulfur dioxide float64 0 60 11.0000 25.0000 15.000 5.08
6 total sulfur dioxide float64 0 144 34.0000 67.0000 54.000 6.60
7 density float64 0 436 0.9978 0.9968 0.997 7.96
8 pH float64 0 89 3.5100 3.2000 3.260 5.91
9 sulphates float64 0 96 0.5600 0.6800 0.650 5.73
10 alcohol float64 0 65 9.4000 9.8000 9.800 5.19
11 quality int64 0 6 5.0000 5.0000 5.000 1.71

Another tool I use is pandas_profiling

import sys

!"{sys.executable}" -m pip install -U pandas-profiling[notebook]
!jupyter nbextension enable --py widgetsnbextension

Collecting pandas-profiling[notebook]
  Downloading https://files.pythonhosted.org/packages/dd/12/e2870750c5320116efe7bebd4ae1709cd7e35e3bc23ac8039864b05b9497/pandas_profiling-2.11.0-py2.py3-none-any.whl (243kB)
     |████████████████████████████████| 245kB 4.1MB/s 
Requirement already satisfied, skipping upgrade: jinja2>=2.11.1 in /usr/local/lib/python3.6/dist-packages (from pandas-profiling[notebook]) (2.11.3)
Requirement already satisfied, skipping upgrade: seaborn>=0.10.1 in /usr/local/lib/python3.6/dist-packages (from pandas-profiling[notebook]) (0.11.1)
Collecting requests>=2.24.0
  Downloading https://files.pythonhosted.org/packages/29/c1/24814557f1d22c56d50280771a17307e6bf87b70727d975fd6b2ce6b014a/requests-2.25.1-py2.py3-none-any.whl (61kB)
     |████████████████████████████████| 61kB 4.3MB/s 
Requirement already satisfied, skipping upgrade: attrs>=19.3.0 in /usr/local/lib/python3.6/dist-packages (from pandas-profiling[notebook]) (20.3.0)
Requirement already satisfied, skipping upgrade: joblib in /usr/local/lib/python3.6/dist-packages (from pandas-profiling[notebook]) (1.0.0)
Collecting confuse>=1.0.0
  Downloading https://files.pythonhosted.org/packages/6d/55/b4726d81e5d6509fa3441f770f8a9524612627dc1b2a7d6209d1d20083fe/confuse-1.4.0-py2.py3-none-any.whl
Collecting htmlmin>=0.1.12
  Downloading https://files.pythonhosted.org/packages/b3/e7/fcd59e12169de19f0131ff2812077f964c6b960e7c09804d30a7bf2ab461/htmlmin-0.1.12.tar.gz
Collecting phik>=0.10.0
  Downloading https://files.pythonhosted.org/packages/d9/27/d4197ed93c26d9eeedb7c73c0f24462a65c617807c3140e012950c35ccf9/phik-0.11.0.tar.gz (594kB)
     |████████████████████████████████| 604kB 7.0MB/s 
Requirement already satisfied, skipping upgrade: ipywidgets>=7.5.1 in /usr/local/lib/python3.6/dist-packages (from pandas-profiling[notebook]) (7.6.3)
Collecting tangled-up-in-unicode>=0.0.6
  Downloading https://files.pythonhosted.org/packages/4a/e2/e588ab9298d4989ce7fdb2b97d18aac878d99dbdc379a4476a09d9271b68/tangled_up_in_unicode-0.0.6-py3-none-any.whl (3.1MB)
     |████████████████████████████████| 3.1MB 13.0MB/s 
Requirement already satisfied, skipping upgrade: pandas!=1.0.0,!=1.0.1,!=1.0.2,!=1.1.0,>=0.25.3 in /usr/local/lib/python3.6/dist-packages (from pandas-profiling[notebook]) (1.1.5)
Collecting visions[type_image_path]==0.6.0
  Downloading https://files.pythonhosted.org/packages/98/30/b1e70bc55962239c4c3c9660e892be2d8247a882135a3035c10ff7f02cde/visions-0.6.0-py3-none-any.whl (75kB)
     |████████████████████████████████| 81kB 7.9MB/s 
Requirement already satisfied, skipping upgrade: missingno>=0.4.2 in /usr/local/lib/python3.6/dist-packages (from pandas-profiling[notebook]) (0.4.2)
Requirement already satisfied, skipping upgrade: scipy>=1.4.1 in /usr/local/lib/python3.6/dist-packages (from pandas-profiling[notebook]) (1.4.1)
Requirement already satisfied, skipping upgrade: numpy>=1.16.0 in /usr/local/lib/python3.6/dist-packages (from pandas-profiling[notebook]) (1.19.5)
Requirement already satisfied, skipping upgrade: matplotlib>=3.2.0 in /usr/local/lib/python3.6/dist-packages (from pandas-profiling[notebook]) (3.2.2)
Collecting tqdm>=4.48.2
  Downloading https://files.pythonhosted.org/packages/d9/13/f3f815bb73804a8af9cfbb6f084821c037109108885f46131045e8cf044e/tqdm-4.57.0-py2.py3-none-any.whl (72kB)
     |████████████████████████████████| 81kB 8.5MB/s 
Collecting jupyter-client>=6.0.0; extra == "notebook"
  Downloading https://files.pythonhosted.org/packages/83/d6/30aed7ef13ff3f359e99626c1b0a32ebbc3bf9b9d5616ec46e9e245d5fa9/jupyter_client-6.1.11-py3-none-any.whl (108kB)
     |████████████████████████████████| 112kB 53.1MB/s 
Requirement already satisfied, skipping upgrade: jupyter-core>=4.6.3; extra == "notebook" in /usr/local/lib/python3.6/dist-packages (from pandas-profiling[notebook]) (4.7.1)
Requirement already satisfied, skipping upgrade: MarkupSafe>=0.23 in /usr/local/lib/python3.6/dist-packages (from jinja2>=2.11.1->pandas-profiling[notebook]) (1.1.1)
Requirement already satisfied, skipping upgrade: chardet<5,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests>=2.24.0->pandas-profiling[notebook]) (3.0.4)
Requirement already satisfied, skipping upgrade: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests>=2.24.0->pandas-profiling[notebook]) (2020.12.5)
Requirement already satisfied, skipping upgrade: idna<3,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests>=2.24.0->pandas-profiling[notebook]) (2.10)
Requirement already satisfied, skipping upgrade: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests>=2.24.0->pandas-profiling[notebook]) (1.24.3)
Requirement already satisfied, skipping upgrade: pyyaml in /usr/local/lib/python3.6/dist-packages (from confuse>=1.0.0->pandas-profiling[notebook]) (3.13)
Requirement already satisfied, skipping upgrade: numba>=0.38.1 in /usr/local/lib/python3.6/dist-packages (from phik>=0.10.0->pandas-profiling[notebook]) (0.51.2)
Requirement already satisfied, skipping upgrade: ipykernel>=4.5.1 in /usr/local/lib/python3.6/dist-packages (from ipywidgets>=7.5.1->pandas-profiling[notebook]) (4.10.1)
Requirement already satisfied, skipping upgrade: traitlets>=4.3.1 in /usr/local/lib/python3.6/dist-packages (from ipywidgets>=7.5.1->pandas-profiling[notebook]) (4.3.3)
Requirement already satisfied, skipping upgrade: ipython>=4.0.0; python_version >= "3.3" in /usr/local/lib/python3.6/dist-packages (from ipywidgets>=7.5.1->pandas-profiling[notebook]) (5.5.0)
Requirement already satisfied, skipping upgrade: nbformat>=4.2.0 in /usr/local/lib/python3.6/dist-packages (from ipywidgets>=7.5.1->pandas-profiling[notebook]) (5.1.2)
Requirement already satisfied, skipping upgrade: widgetsnbextension~=3.5.0 in /usr/local/lib/python3.6/dist-packages (from ipywidgets>=7.5.1->pandas-profiling[notebook]) (3.5.1)
Requirement already satisfied, skipping upgrade: jupyterlab-widgets>=1.0.0; python_version >= "3.6" in /usr/local/lib/python3.6/dist-packages (from ipywidgets>=7.5.1->pandas-profiling[notebook]) (1.0.0)
Requirement already satisfied, skipping upgrade: python-dateutil>=2.7.3 in /usr/local/lib/python3.6/dist-packages (from pandas!=1.0.0,!=1.0.1,!=1.0.2,!=1.1.0,>=0.25.3->pandas-profiling[notebook]) (2.8.1)
Requirement already satisfied, skipping upgrade: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas!=1.0.0,!=1.0.1,!=1.0.2,!=1.1.0,>=0.25.3->pandas-profiling[notebook]) (2018.9)
Requirement already satisfied, skipping upgrade: networkx>=2.4 in /usr/local/lib/python3.6/dist-packages (from visions[type_image_path]==0.6.0->pandas-profiling[notebook]) (2.5)
Requirement already satisfied, skipping upgrade: Pillow; extra == "type_image_path" in /usr/local/lib/python3.6/dist-packages (from visions[type_image_path]==0.6.0->pandas-profiling[notebook]) (7.0.0)
Collecting imagehash; extra == "type_image_path"
  Downloading https://files.pythonhosted.org/packages/8e/18/9dbb772b5ef73a3069c66bb5bf29b9fb4dd57af0d5790c781c3f559bcca6/ImageHash-4.2.0-py2.py3-none-any.whl (295kB)
     |████████████████████████████████| 296kB 38.2MB/s 
Requirement already satisfied, skipping upgrade: kiwisolver>=1.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=3.2.0->pandas-profiling[notebook]) (1.3.1)
Requirement already satisfied, skipping upgrade: cycler>=0.10 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=3.2.0->pandas-profiling[notebook]) (0.10.0)
Requirement already satisfied, skipping upgrade: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=3.2.0->pandas-profiling[notebook]) (2.4.7)
Requirement already satisfied, skipping upgrade: tornado>=4.1 in /usr/local/lib/python3.6/dist-packages (from jupyter-client>=6.0.0; extra == "notebook"->pandas-profiling[notebook]) (5.1.1)
Requirement already satisfied, skipping upgrade: pyzmq>=13 in /usr/local/lib/python3.6/dist-packages (from jupyter-client>=6.0.0; extra == "notebook"->pandas-profiling[notebook]) (22.0.2)
Requirement already satisfied, skipping upgrade: llvmlite<0.35,>=0.34.0.dev0 in /usr/local/lib/python3.6/dist-packages (from numba>=0.38.1->phik>=0.10.0->pandas-profiling[notebook]) (0.34.0)
Requirement already satisfied, skipping upgrade: setuptools in /usr/local/lib/python3.6/dist-packages (from numba>=0.38.1->phik>=0.10.0->pandas-profiling[notebook]) (53.0.0)
Requirement already satisfied, skipping upgrade: six in /usr/local/lib/python3.6/dist-packages (from traitlets>=4.3.1->ipywidgets>=7.5.1->pandas-profiling[notebook]) (1.15.0)
Requirement already satisfied, skipping upgrade: decorator in /usr/local/lib/python3.6/dist-packages (from traitlets>=4.3.1->ipywidgets>=7.5.1->pandas-profiling[notebook]) (4.4.2)
Requirement already satisfied, skipping upgrade: ipython-genutils in /usr/local/lib/python3.6/dist-packages (from traitlets>=4.3.1->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.2.0)
Requirement already satisfied, skipping upgrade: pickleshare in /usr/local/lib/python3.6/dist-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.7.5)
Requirement already satisfied, skipping upgrade: pygments in /usr/local/lib/python3.6/dist-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling[notebook]) (2.6.1)
Requirement already satisfied, skipping upgrade: prompt-toolkit<2.0.0,>=1.0.4 in /usr/local/lib/python3.6/dist-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling[notebook]) (1.0.18)
Requirement already satisfied, skipping upgrade: pexpect; sys_platform != "win32" in /usr/local/lib/python3.6/dist-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling[notebook]) (4.8.0)
Requirement already satisfied, skipping upgrade: simplegeneric>0.8 in /usr/local/lib/python3.6/dist-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.8.1)
Requirement already satisfied, skipping upgrade: jsonschema!=2.5.0,>=2.4 in /usr/local/lib/python3.6/dist-packages (from nbformat>=4.2.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (2.6.0)
Requirement already satisfied, skipping upgrade: notebook>=4.4.1 in /usr/local/lib/python3.6/dist-packages (from widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (5.3.1)
Requirement already satisfied, skipping upgrade: PyWavelets in /usr/local/lib/python3.6/dist-packages (from imagehash; extra == "type_image_path"->visions[type_image_path]==0.6.0->pandas-profiling[notebook]) (1.1.1)
Requirement already satisfied, skipping upgrade: wcwidth in /usr/local/lib/python3.6/dist-packages (from prompt-toolkit<2.0.0,>=1.0.4->ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.2.5)
Requirement already satisfied, skipping upgrade: ptyprocess>=0.5 in /usr/local/lib/python3.6/dist-packages (from pexpect; sys_platform != "win32"->ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.7.0)
Requirement already satisfied, skipping upgrade: Send2Trash in /usr/local/lib/python3.6/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (1.5.0)
Requirement already satisfied, skipping upgrade: nbconvert in /usr/local/lib/python3.6/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (5.6.1)
Requirement already satisfied, skipping upgrade: terminado>=0.8.1 in /usr/local/lib/python3.6/dist-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.9.2)
Requirement already satisfied, skipping upgrade: testpath in /usr/local/lib/python3.6/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.4.4)
Requirement already satisfied, skipping upgrade: entrypoints>=0.2.2 in /usr/local/lib/python3.6/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.3)
Requirement already satisfied, skipping upgrade: pandocfilters>=1.4.1 in /usr/local/lib/python3.6/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (1.4.3)
Requirement already satisfied, skipping upgrade: bleach in /usr/local/lib/python3.6/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (3.3.0)
Requirement already satisfied, skipping upgrade: mistune<2,>=0.8.1 in /usr/local/lib/python3.6/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.8.4)
Requirement already satisfied, skipping upgrade: defusedxml in /usr/local/lib/python3.6/dist-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.6.0)
Requirement already satisfied, skipping upgrade: packaging in /usr/local/lib/python3.6/dist-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (20.9)
Requirement already satisfied, skipping upgrade: webencodings in /usr/local/lib/python3.6/dist-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.1->pandas-profiling[notebook]) (0.5.1)
Building wheels for collected packages: htmlmin, phik
  Building wheel for htmlmin (setup.py) ... done
  Created wheel for htmlmin: filename=htmlmin-0.1.12-cp36-none-any.whl size=27085 sha256=f607899a091f969cab78264a9dd354c4fdf0f612ae8edc9f51ac6e1f9c072fde
  Stored in directory: /root/.cache/pip/wheels/43/07/ac/7c5a9d708d65247ac1f94066cf1db075540b85716c30255459
  Building wheel for phik (setup.py) ... done
  Created wheel for phik: filename=phik-0.11.0-cp36-none-any.whl size=599738 sha256=b279f73bbcc93614edd2d58b807c063fca3c2c3eddac96f1cb07914a67cfde86
  Stored in directory: /root/.cache/pip/wheels/af/54/11/aba77f21075918de02f7964eabfe8c10d5542df9e6ad10b225
Successfully built htmlmin phik
ERROR: google-colab 1.0.0 has requirement requests~=2.23.0, but you'll have requests 2.25.1 which is incompatible.
ERROR: datascience 0.10.6 has requirement folium==0.2.1, but you'll have folium 0.8.3 which is incompatible.
Installing collected packages: requests, confuse, htmlmin, phik, tangled-up-in-unicode, imagehash, visions, tqdm, jupyter-client, pandas-profiling
  Found existing installation: requests 2.23.0
    Uninstalling requests-2.23.0:
      Successfully uninstalled requests-2.23.0
  Found existing installation: tqdm 4.41.1
    Uninstalling tqdm-4.41.1:
      Successfully uninstalled tqdm-4.41.1
  Found existing installation: jupyter-client 5.3.5
    Uninstalling jupyter-client-5.3.5:
      Successfully uninstalled jupyter-client-5.3.5
  Found existing installation: pandas-profiling 1.4.1
    Uninstalling pandas-profiling-1.4.1:
      Successfully uninstalled pandas-profiling-1.4.1
Successfully installed confuse-1.4.0 htmlmin-0.1.12 imagehash-4.2.0 jupyter-client-6.1.11 pandas-profiling-2.11.0 phik-0.11.0 requests-2.25.1 tangled-up-in-unicode-0.0.6 tqdm-4.57.0 visions-0.6.0
Enabling notebook extension jupyter-js-widgets/extension...
      - Validating: OK
from ipywidgets import widgets

# Our package
from pandas_profiling import ProfileReport
from pandas_profiling.utils.cache import cache_file
profile = ProfileReport(df, title="red wine", html={"style": {"full_width": True}}, sort="None")

Takes a couple minutes to process and display the results even with a small dataset.

You do get some richer analysis like Correlation plots and distributions of the variables.

profile.to_widgets()
/usr/local/lib/python3.6/dist-packages/pandas_profiling/profile_report.py:424: UserWarning: Ipywidgets is not yet fully supported on Google Colab (https://github.com/googlecolab/colabtools/issues/60).As an alternative, you can use the HTML report. See the documentation for more information.
  "Ipywidgets is not yet fully supported on Google Colab (https://github.com/googlecolab/colabtools/issues/60)."

An alternative is Sweetviz. I tend to like this a bit better for its display of distributions. In general, it also loads a bit quicker.

!pip -q install sweetviz
     |████████████████████████████████| 15.1MB 331kB/s 
import sweetviz as sv
sweet_report = sv.analyze(df)
sweet_report.show_notebook(w=1200.)

My tool for numerical transformations

I wanted a simple way to view the distributions of the features and more importantly a way to view the data after numerical transformations such as Box-Cox or a log transform.

The following plot is a sample of what I developed.

Numerical transfors

See this Post for more detail.